ready signal for cluster announce and publish #175

antmat · 2015-03-23T14:04:02Z

I think it's a better approach to announce locator the first time. It can also increase startup speed and it is needed by upcoming unicorn based discovery.

kobolog · 2015-03-23T14:14:36Z

Why?

kobolog · 2015-03-23T14:17:54Z

I've made my point wrt service startup ordering and dependencies in the previous PR. What's yours? What problem is it going to resolve?

antmat · 2015-03-23T14:18:11Z

Because we don't really need to reannounce in case of unicorn, but scheme with timer forces us to do it. Besides which interval of timer is the best? 1s? 60s? I think better to make announce right after context bootstrap.

antmat · 2015-03-23T14:19:27Z

This PR does not add any dependencies or ordering in fact. It just add ability to notify about context bootstrap termination, which is the best time to announce services.

kobolog · 2015-03-23T14:23:24Z

Are you trying to say that once written, the announcement information will always stay in ZK? This is not true.

kobolog · 2015-03-23T14:24:17Z

Also this IS ordering, specifically one form of it called checkpointing.

antmat · 2015-03-23T14:26:39Z

What would happen with information in zookeeper?

antmat · 2015-03-23T14:33:29Z

And what's really harmfull in your point of view about this kind of checkpointing? It does not involve new dependencies, as cluster_t is already dependent on locator. Ok, another option is to pass to api::cluster_t constructor instance of locator, as it is really required for cluster_t's functionality. But imho signal is a more general approach.

kobolog · 2015-03-23T14:35:02Z

You don't need a signal to make it work. You have to use a timer to make it fault-tolerant.

antmat · 2015-03-23T14:37:02Z

Fault-tolerancy is guaranted by zookeper watch. If announce information disappears or connection is reset - we make reannounce. We don't need to do it by timer.

antmat · 2015-03-23T14:38:28Z

From my point of view timer is more expensive and looks more like a crutch than a signal or even passing locator to cluster_t's constructor.

oktocat · 2015-03-23T14:40:03Z

What would happen with information in zookeeper?

Anything. It could be lost, it could be corrupted, it could be deleted, it could be never actually written in there. You can not rely on that for sure.

antmat · 2015-03-23T14:42:43Z

Zookeeper makes some guarantees. If we can not rely on that guarantees - we can not use zookeeper. It guarantees that in case I've issued write command and it succeeded - data is there. It also guarantees that in case anything happens with connection OR data in that node - we will get a watch triggering.

oktocat · 2015-03-23T14:48:15Z

And there are no bugs in ZK, yes. And network is completely reliable, and disks are 100% redundant. And there are butterflies shitting rainbows in the sky instead of snow blizzard in the end of March.

antmat · 2015-03-23T14:51:04Z

That's not about reliability. The point is - if the problem will occur with ZK timer will not help with this situation(there is a problem with ZK, yes). If there is no problem with ZK - the situation is handled properly without timer.

kobolog · 2015-03-23T15:39:33Z

So, first of all, I don't like signals in general. They were used here to solve one thing and apparently the amount of mess they introduce is much worse than their benefits. There's a task in Trello to replace them for the next release: https://trello.com/c/5kOeRUxo. The mess I'm talking about is mainly about synchronization requirements signals impose on the code, for example in your Unicorn cluster implementation, you have a data race in https:/cocaine/cocaine-plugins/pull/57/files#diff-8305a92f02ba42856c2a1ea2414d4323R242 and probably other parts because of that. I doubt you have ever seen it happen because it's in an pessimistic code path, but still.
Next, this is a bit naive to think that good of any distributed system like you do with ZK. What happens if the watch trigger message never arrives at the node? Are you 100% sure that ZK has 'at least once' semantics for their delivery? What if underlying disks get corrupted? Triggering another write on timer might expose such failure.
And last, submitting the same PR again won't really raise it's chances of being accepted.

kobolog · 2015-03-23T15:39:49Z

So I'm not going to accept it, again.

antmat · 2015-03-23T16:13:33Z

Just to clarify - I beleived that previous PR was not accepted because of strict order of service start only, not because of ready signal.
And about ZK - https://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#ch_zkWatches

"What happens if the watch trigger message never arrives at the node?" - On any connection loss we will recieve a session event and make reannounce. If we are unable to announce - we will retry by timer.
"Are you 100% sure that ZK has 'at least once' semantics for their delivery?" - Yes they guarantee that, but even if it was not so any watch triggers reannounce so it would be ok.
"What if underlying disks get corrupted?" - ZK is meant to be set up as a cluster and manage RAFT consensus protocol. Anyway in this case timer would not help us.

Summarizing.

If we rely on zookeeper - I believe we can use it without timers and reannounces. If we don't - timer won't help. Also non-timer approach reduces time when cocaine cluster is in unconsistent state due to any node failure (node is down but was not dropped out).
I see now, that signal implementation is no-way for an important reason. How do you feel about adding locator parameter to cluster_t constructor? Do you agree that cluster is dependent on locator and we can not do anything with it (at least at the moment)?

kobolog · 2015-03-23T16:40:46Z

It doesn't follow that dropped watch notification packets will be resent from the link you've posted. Note that I'm not talking about connection loss. Have you done tests against such cases using tc, for example? I've seen multiple times when watch notifications haven't been delivered to the clients here in Spotify. It might be an issue with the client we use, but bottom line is it's not a strong guarantee because of bugs and generally unreliable transport. In such cases, timers help ensure that even if the ZK information is no longer consistent with the actual cluster state (because some notification is lost, underlying disk got corrupted or whatever else), we WILL try to make it consistent every so often. OTOH, there's nothing wrong with the fact that you rely on guarantees stated in the ZK docs, since we can base our SLAs on it.
As for the last point, there's a strong guarantee that the locator service is available when cluster manager instance is being constructed already, there's little use for an additional constructor parameter, BUT since the cluster manager is constructed in locator's constructor, there's NO strong guarantee that locator's internal state is fully constructed.

antmat · 2015-03-23T18:02:05Z

"...There is one case where a watch may be missed: a watch for the existance of a znode not yet created will be missed if the znode is created and deleted while disconnected..." implies that in our case all watches will be delivered (or at least connection loss watch will be triggered on client - that is also ok). I will make several more tests tomorrow to check if this is true.
I can accept your point about "better to care more than care less" but I don't like timers because

It generates additional requests to zookeeper. In case of, say, 200 hosts cloud, and, say, 1s timer we will get 400rps additional requests. It's not critical, but sensible overhead.
In case we use watches AND timers code complexity will increase.
In case we use only timers we loose benefit from near zero time notifications and need to choose between high rps to ZK and failed requests in case node goes offline but not removed from metalocator.

kobolog · 2015-03-23T18:22:23Z

There's no implicative relationship between these two statements, only assumptions. Timers are not the best solution obviously, you're right, but disregarding possible ZK failures is not a good thing to do as well, and these failures DO happen. Whether or not ZK client can workaround them is something you can find out via testing.

I'm not saying that ZK watches should be removed, obviously they are a very good thing for us to eliminate unnecessary polling and excessive traffic, however the argument about zero-time notifications does not have much ground since we never seen an usecase where this became an issue. Actually, I see a potential regression here since, for example, network flapping would trigger ephemeral ZK nodes to appear and disappear from ZK (if you use them, I don't know) which in turn would trigger flapping in metalocator's internal cluster representation and flood the kernel with IPVS requests.

Anyway, I can accept this PR if you rework it in such a way to avoid possible data races emerging from the use of synchronous signals, e.g. by posting an action from the signal handler and if you can actually accumulate enough data to prove that general ZK weirdness won't affect metalocator consistency (which is probably for @oktocat to decide).

P.S. ZK is based on ZAB, not RAFT.

kobolog · 2015-03-24T04:38:09Z

BTW, the easiest way to make signals async-safe would be to boost::asio::io_service::wrap() the handlers. At least, that's what I'm going to do as the first step to resolve this signal synchronization mess.

antmat · 2015-03-24T07:52:23Z

Can you please explain which data race are you talking about in case of unicorn? As I see we connect slot from unicorn_cluster constructor and invoke signal later when construction is finished. As for concurrent connecting slot and invoking signal I think it's enough to move call and connect under mutex, isn't it? https:/cocaine/cocaine-core/blob/v0.12/include/cocaine/context/signal.hpp#L69 . In general it will not affect performance as they rarely overlap.
Looks like I'm missing something here.

antmat · 2015-03-24T10:00:55Z

JFYI:

#!/bin/bash
echo "Flushing rules"
iptables -F INPUT
iptables -F OUTPUT
ip6tables -F INPUT
ip6tables -F OUTPUT

if [ "$1" != "-f" ]
then
        echo "Disabling ZK"
        ip6tables -A INPUT -s $IP1 -j DROP
        ip6tables -A OUTPUT -d $IP1 -j DROP

        ip6tables -A INPUT -s $IP2 -j DROP
        ip6tables -A OUTPUT -d $IP2 -j DROP

        ip6tables -A INPUT -s $IP3 -j DROP
        ip6tables -A OUTPUT -d $IP3 -j DROP
fi

I've tested via this script. After disabling firewall reannounce occurs. Flapping via this script in a cycle also leads to reannounce when firewall is disabled.

…s finally discussed and merged or rejected.

kobolog · 2015-03-24T14:10:15Z

I'm talking about boost::asio thread safety. It's generally unsafe to use the same object from multiple threads except boost::asio::io_service. In your code, cluster manager timers are set up from the signal handler (which is invoked from the context's thread), while the actual locator's reactor which owns those timers is already running in another thread. While right now I'm not sure if you're using these timers or other boost::casio primitives concurrently from locator's thread before the signal (but I never looked into your ZK client, which might do it), it might easily happen in the future and we'll get a subtle hard-to-detect data race.

See, that's exactly what I'm talking about w.r.t. signal concurrency mess and why adding a new signal is a not the best idea.

antmat · 2015-03-25T08:34:10Z

I still can not understand. We have multiple chambers (execution units) and all async handlers CAN run simultaniously in different thread, so ANY access to asio objects particulaary and other thread unsafe objects generally should be synchronized or guaranteed to be executed in one thread. This is not related to signals or anything else. Just if the code can run at the same time from different threads it should be thread safe.

kobolog · 2015-03-25T15:29:08Z

For asio chambers (engine & service), we use a share-nothing architecture. The only thing engine chambers do share is service dispatches, which are synchronized. But signals introduce an additional sharing point for threads, specifically signal handlers are invoked from the context thread, but access objects from chambers' threads. So, for signal handlers, either additional synchronization or asynchronous dispatch should be used to avoid data races.

Let me explain in greater detail: given a running service, it accepts new connections in the service chamber and then dispatches them to one of the engine chambers. From this point, the connection is owned by the engine chamber and shares nothing with anything else in the process, except its service dispatch, which is already synchronized in the message dispatch path and should be synchronized in the actual service by the service author. If you connect to a signal from such service, then the signal handler will be invoked from a completely unrelated thread and at the same time it would access previously unshared objects from the service or engine chamber (in your case, service chamber where you do timers and other asio stuff for the cluster manager). These objects are not synchronized, hence the race.

kobolog · 2015-03-25T15:29:55Z

I'm working on asynchronous signal implementation right now which will solve this problem once and for all, but RIGHT NOW, it's not safe to simply connect signals and use them.

kobolog · 2015-03-25T19:31:07Z

#177

ready signal for cluster announce and publish

2bf8327

kobolog closed this Mar 23, 2015

kobolog reopened this Mar 23, 2015

antmat mentioned this pull request Mar 23, 2015

Unicorn discovery cocaine/cocaine-plugins#57

Closed

antmat added a commit to antmat/cocaine-plugins that referenced this pull request Mar 24, 2015

Announce and read based on timers. Till PR cocaine/cocaine-core#175 i…

575a338

…s finally discussed and merged or rejected.

kobolog closed this Mar 27, 2015

antmat deleted the ready_signal branch March 31, 2015 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ready signal for cluster announce and publish #175

ready signal for cluster announce and publish #175

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

antmat commented Mar 23, 2015

oktocat commented Mar 23, 2015

antmat commented Mar 23, 2015

oktocat commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 24, 2015

antmat commented Mar 24, 2015

antmat commented Mar 24, 2015

kobolog commented Mar 24, 2015

antmat commented Mar 25, 2015

kobolog commented Mar 25, 2015

kobolog commented Mar 25, 2015

kobolog commented Mar 25, 2015

ready signal for cluster announce and publish #175

ready signal for cluster announce and publish #175

Conversation

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

antmat commented Mar 23, 2015

oktocat commented Mar 23, 2015

antmat commented Mar 23, 2015

oktocat commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

antmat commented Mar 23, 2015

kobolog commented Mar 23, 2015

kobolog commented Mar 24, 2015

antmat commented Mar 24, 2015

antmat commented Mar 24, 2015

kobolog commented Mar 24, 2015

antmat commented Mar 25, 2015

kobolog commented Mar 25, 2015

kobolog commented Mar 25, 2015

kobolog commented Mar 25, 2015